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Abstract 

We propose algorithms to approximate directed information graphs. Directed information 
graphs are probabilistic graphical models that depict causal dependencies between stochas¬ 
tic processes in a network. The proposed algorithms identify optimal and near-optimal 
approximations in terms of Kullback-Leibler divergence. The user-chosen sparsity trades 
off the quality of the approximation against visual conciseness and computational tractabil- 
ity. One class of approximations contains graphs with specified in-degrees. Another class 
additionally requires that the graph is connected. For both classes, we propose algorithms 
to identify the optimal approximations and also near-optimal approximations, using a novel 
relaxation of submodularity. We also propose algorithms to identify the r-best approxima¬ 
tions among these classes, enabling robust decision making. 

Keywords: probabilistic graphical models, network inference, causality, submodularity, 
approximation algorithms 


1. Introduction 

Many fields of the sciences and engineering require analysis, modeling, and decision making 
using networks, typically represented by graphs. Social networks, financial networks, and 
biological networks are a few categories that are relevant not only academically but also in 
daily life. A major challenge for studying networks is identifying a concise topology, such 
as who strongly influences whom in a social network. Real world networks are often large— 
there are tens of thousands of human genes and trillions of connections in the human brain. 
Such scales make human visual processing of and decision making with the whole network 
prohibitive. This paper investigates algorithms to identify provably good approximations of 
the network topology, which capture important system dynamics while significantly reducing 
the number of edges to enable tractable analysis. 
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Across different domains, edges are used to model various kinds of ties such as physical 
connections or dynamic relationships. For instance, in depicting a computer network, an 
edge might correspond to a physical wire or a packet exchange between a sender and a 
receiver. For the human brain, edges could correspond to information flow between different 
cells or brain regions (Takahashi et ah, 2015 Kim et al., 2014). For online social networks, 


edges might represent user-defined relationships or pairs of users that frequently message 
each other (Ver Steeg and Galstyan, 2012, 2013). Edges that represent dynamics often must 


be inferred from activity, in some cases statistically. 

There is a large literature on graphs whose edges depict statistical relationships. Markov 
and Bayesian networks are well-known probabilistic graphical models whose edges represent 
correlation. Many methods have been proposed to infer and approximate the networks from 
i.i.d. data, often relying on heuristics. For an overview, see Chapters 18 and 20 in Koller 
and Friedman (2009). For applications involving agents interacting with each other over 
time, whether in finance, biology, social networks, or other domains, there is interest in 
identifying and representing causal influences between the agents, not just correlation. 

One approach uses known families of models, such as with structural equation modeling, 


to distinguish cause and effect ( 

Zhang et al. 

2015, 2014; Chen et al. 

2012 

). A recent work 

uses belief propagation to infer directionality ( 

Chang et al., 

2014) 

Alternatively, under 


appropriate conditions such as with expert labeling and no feedback, Bayesian networks can 


depict causal relationships using Pearl’s interventional calculus (Pearl, 2009). We consider 


the general setting when such conditions or modeling assumptions might not hold. 


Recently, directed information graphs were introduced to address this issue (Quinn 


et al. 2011 Amblard and Michel 2011). Edges in directed information graphs depict 
statistical causation between non-i.i.d. time-series. In this work, “statistical causation” is 


in the sense of Granger causality (Granger, 1969), where a process X statistically causes 
Y if in sequentially predicting Y), knowledge of the past X helps in prediction even 
when Y t_1 and the past of all the other processes are already known. These graphs use 
directed information, an information theoretic quantity, which is well-defined for any class 
of stochastic processes. Directed information has been applied to a range of settings, such 


as neuroscience (Quinn et al. 


gene regulatory networks (Rao et al. 


and Galstyan 2012, 2013, Quinn et al. 


2011 


2007 


2012 ). 


Kim et al. 2011, So et al., 2012 Kim et al., 2014), 


2008), and online social networks (Ver Steeg 


For networks with thousands or millions of edges, directed information graphs become 
too complicated for direct humans analysis. A major approach to simplifying the graphs is 
to only keep a few edges which together best approximate the dynamics of the system. For 
example, a directed tree is among the simplest graphs. See Figure [l] Each node has only 
one parent. Trees have the fewest number of edges possible while being connected. There is 
a root node and a path from the root to every other node. The graph is concise, facilitating 
human analysis and decision making. A recent work proposed an efficient algorithm to 
identify the best directed tree approximation, where goodness of approximation is measured 
by Kullback-Leibler (KL) divergence from the full joint distribution to the distribution 


induced by the directed tree ( 

Quinn et al. 

2013a) 

efficient, the algorithm in 

Quinn et al. ( 

2013t 

i) only 


2013a). In addition to being computationally 


and does not require the full joint distribution to find the best approximation. 
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containing a directed tree. 
Figure 1: Diagrams for two network approximations. 


Though directed tree approximations are easy to comprehend and efficient to construct, 
they cannot depict feedback. Feedback is essential in many networks, such as in the brain 
and gene regulatory networks. Thus, for some applications, it is necessary to consider higher 
order approximations. For instance, a graph with in-degrees two and three and containing 
a directed spanning tree as a subgraph would trade-off some simplicity and computational 
efficiency in order to capture more complex relationships in the network. 


1.1 Our Contributions 

We propose an algorithm to identify the optimal connected bounded in-degree approxi¬ 
mations. The algorithm requires only low-dimensional statistics, similar to the algorithm 
for directed tree approximations. The user decides how complex to make the approxima¬ 
tions, changing the in-degrees to trade off visual and computational simplicity against the 
accuracy of the approximation. 

Identifying optimal approximations becomes prohibitive for large in-degrees. For situ¬ 
ations where a near-optimal approximation would suffice, we propose algorithms using a 
greedy search. We identify sufficient conditions, namely a relaxed form of submodularity, 
that ensure near-optimality. 

Additionally, having multiple, good approximations can aid in understanding network 
dynamics. Instead of just having the best approximation, having the five or ten best ap¬ 
proximations in order can yield insight into which edges are most important—those that 
persist in the top approximations—and those that are less significant. Being able to iden¬ 
tify the top-r approximations also enables the user to identify the best approximation of 
more restricted classes of topologies. For example, suppose that the best directed tree ap¬ 
proximation for a network had a height of six. If the user desires the best directed tree 
approximation with height less than four, he/she can look among the top-r approximations 
until finding a tree with height less than four and it would necessarily be the best such 
approximation. We develop algorithms to identify the top-r approximations with similar 
complexity as finding the optimal approximation. 

Lastly, we use simulations to validate the quality of the approximations found. 
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1.2 Related Work 


There is a large body of work on approximating Bayesian and Markov networks. One well 


known result is an algorithm to identify optimal tree approximations (Chow and Liu, 1968). 


The algorithm finds a maximum weight spanning tree using mutual information for weights 
and only requires distributions of pairs of variables. 

In general, identifying more complex approximations cannot be done in a computation¬ 
ally efficient manner. Bayesian networks are NP-hard to approximate for topologies with 


specified in-degree larger than one (Chickering 1996) and even polytrees with in-degree 


two (Dasgupta, 1999). Some works have focused on identifying optimal approximations of 
subclasses of polytrees. One work finds the best bounded in-degree approximation that pre¬ 


serves the statistical dependencies in the best tree approximation (Carvalho and Oliveira 


2007). Another work finds an optimal polytree that can be converted to a tree with a 


bounded number of edge or node deletions (Gaspers et al., 2012) 


Other approaches to approximating graphical models include using Zi-regularized regres- 


et al. 


sion to identify sparse Ising models for Markov networks with binary variables (Ravikumar 
2010). Another approach proposes a linear programming relaxation coupled with 


branch and bound to find an optimal approximation (Jaakkola et al. 2010). Annealed 


importance sampling is used in Niinimaki and Koivisto (2013); see references therein for 


Markov chain Monte Carlo based techniques. The performance of a forward-backward 
greedy search for Markov networks in a high-dimensional setting is studied in Jalali et al. 
(2011). In Pernkopf and Bilmes (2010), an algorithm is proposed to first identify an variable 


ordering and then greedily select parents. 

There has been much less work developing approximations for directed information 
graphs. In Quinn et al. (2013a), an algorithm is proposed to identify the best directed 


spanning tree approximation for directed information graphs. In Quinn et al. (2012), several 


algorithms are introduced for inferring the exact topology. One of the algorithms can be 
also used to compute the best approximation where the only topological constraints are 
user-specified in-degrees. That is discussed here as Algorithm 1 in Section |4j Several works 
investigated sparse approximations using lasso and related penalties when processes are 


jointly autoregressive with Gaussian noise (Charbonnier et al. 2010; Haufe et al. 2010 


Bolstad et al. 

2011. 

Jung et al., 

2014;; 

Basu et al. 

2015 


In our preliminary work Quinn et al. (2013b), we developed an algorithm to identify the 


optimal bounded in-degree approximation containing a directed spanning tree subgraph. 
This appears here as Algorithm 2. Also, a sufficient condition for a greedy search to re¬ 


turn near-optimal approximations was identified in Quinn et al. (2013b), presented here as 
Definition HU 

There has been research in the graphical models literature for finding the top-r solutions 
for problems such as the MAP realizations for Bayesian or Markov networks (|Nilsson 


Yanover and Weiss 2004 Fromer and Globerson, 2009; Flerova et al., 2012[ Batra et al 


1998 


2012). The present work focuses on finding the top-r solutions for structure learning. 


1.3 Paper Organization 

The paper is organized as follows. Definitions and notations are introduced in Section [2j 
Section [3] reviews directed information graphs. Section [4] presents algorithms to identify the 
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optimal bounded in-degree approximations. Section [5] identifies a sufficient condition for the 
greedy search to construct near optimal approximations. Section [6] describes an algorithm 
to find the top-r approximations. Algorithmic complexity is discussed in Section [7J The 
algorithms are empirically evaluated in Section [8j Section [9] concludes the paper. Proofs 
are in the appendix. 


2. Notation and Information-Theoretic Definitions 

We now define notation. We use for denoting. 

• For a sequence ai, a 2 , ■ ■denote (aj,..., ay) as a\ and a k := a\. Let [m] := {1, ..., m} 
and the power set 2t m l on [m] to be the set of all subsets of [m\. 

• We consider m hnite-alphabet, discrete-time random processes over a horizon n. Let X 
denote the alphabet and V (X) the space of probability measures on X. Denote the ith 
random variable at time t by W.c the ith random process as X a = (W;,i > • • • ,A^„) T , 
the whole collection of all m random processes as X = (Xi,..., X m ) T , and a subset 
of K processes indexed by A C [m] as Xa = (Xa(i)> ■ ■ •, X^/^)" 1 ". 

Remark 1 We consider the finite-alphabet setting to simplify the presentation. The 
results extend to more general cases. 


Conditional and causally conditioned distributions (Kramer, 1998) of X,; given Xy are 


n 


WPx^x*- 1 ,xv-i x i,t\ x \ l i x l) 

t= 1 

n 

(1) 

1,xtj 

t =1 

(2) 


Note the similarity between ([T|) and Q, though in ([2]) the present and future, xT, 
is not conditioned on. In Kramer (19981), the present Xjj was conditioned on in S. 
The reason we remove it will be made clear in Remark [2j 


• Consider the set of processes Xa f° r some A C [m]\{i}. Next consider two sets of 
causally conditioned distributions {-Px;||x A =x 4 £ P (X) : G Xl A l ri } and {Qx;||x t =x 4 £ 

V (X) : x 4 G Xi" 4 !”} along with a marginal distribution Tx 4 £ V (X^l™). Then the 
conditional Kullback-Leibler (KL) divergence between causally conditioned distribu¬ 
tions is given by 


D{P : 


xjx 


MQ: 


i\\^A 


XJX 


22 22 D { P x t . t \K t T'=fiT' WQx lit \x t - l =fi-') Px*-' (t 
4=1 4" 1 


t-1) 
A ) 


(3) 
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I(X i; Xi) := D(P^ Xj \\P Xt P Xj ) = D[Px i \x.\\P Ki \Px. 

n 

= ^i(x; ; x M |x*- 1 ) 

t =i 

Xi) ■■= d(px i]IXj \\p Xi \p x 

n 

=z i ( x f i - x <.‘i x t 

t =i 

I(Xj -> X i ||X A ) := ^(^Xillx^uyjll^Xillx^l^x^.j 

n 

=^ i (xr i ;x M ixr i ,x- 1 ). 


Let i,j 6 [m] and A C [m]\{L j}. The mutual information, directed information 
(Marko, 1973), and causally conditioned directed information (Kramer, 1998) are 

(4) 


I (Xj 


rt~ 1\ 


(5) 


t =l 


While mutual information quantifies statistical correlation (in the colloquial sense of 
statistical interdependence), directed information quantifies statistical causation in 
the sense of Granger causality (Quinn et al.[ 2012 Amblard and Michel, 2012). Note 
that I(Xj-; XQ = I(Xi;Xj), but I(X,- —> XQ ^ I(X* —> Xj) in general. 

Remark 2 In ([2]) and (|5|) ; there is no conditioning on the present Xj.t- This follows 
Marko’s definition (Marko, 1973) and is consistent with Granger causality Granger, 


1969). Massey 1990) and Kramer 1998) later included conditioning on X ])t for the 


specific setting of communication channels. 


3. Directed Information Graphs 


In this section, we briefly review directed information graphs (Quinn et al., 2011| Amblard 


and Michel, 2011). 


Definition 3 A directed information graph is a probabilistic graphical model where each 
node represents a process X,- and an edge X ? — > X,; is drawn if 


I(Xj X.jUXjmJY^jj) > 0. 

It follows immediately that directed information graphs are unique for a given distribution 
Px- Under certain conditions, the directed information graph corresponds to a particular 
factorization of the joint distribution. By the chain rule, the joint distribution Px factorizes 
over time as Px(x) = IIILi -fx t |x* -1 (x t |x t_1 ). If given the full past X t_1 , the processes 
{Xi,..., X m } at time t are mutually independent, Px can be further factorized as 


^x(x) = nn^- 1 ^ x )> ( e ) 

t =1 i=1 

and Px is said to be strictly causal. Equation [6] can be written using causal conditioning 
notation ([2]) as Px(x) = n™i -^XiHX^^^ ( x i || X[ m ]\{i})- A distribution Px is said to be 
positive if Px(x) > 0 f° r all x G X mn . 
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Theorem 4 Quinn et al., 2012) For a joint distribution Px, if Px is positive and strictly 


causal, then the parent sets {A(i)}' l P =1 in the directed information graph are the unique, 
minimal cardinality parent sets such that D(P X || YYfLi Px;||x Ui) ) = 0. 


A graphical separation criterion, similar to d-separation for Bayesian networks, applies 


to directed information graphs (Eichler, 2012). 


4. Optimal Bounded In-Degree Approximations 

When the exact topology is not necessary or is prohibitive to learn, approximations can be 
useful. Approximations with simple topologies facilitate visual comprehension and in some 
cases can be efficient to identify. We investigate algorithms to identify optimal approxima¬ 
tions for two settings. Goodness of the approximations is measured by the KL divergence 
between the full joint distribution and the distribution induced by the approximation. The 
researcher specifies the in-degrees, controlling the complexity. Also, the optimal approxi¬ 
mations will be identified using low dimensional statistics, not the whole joint distribution. 

We consider approximations of the form 

m 

%(x) := n P ^\\ x A(i) ( x * II x A(i))> ( 7 ) 

Z— 1 

where the A(i) C [m]\{z} are candidate parent sets and the marginal distributions {P Xi ||x 1(i) }^= i 
are exact. Let Q denote the set of such approximations. The goal is to find the P x E Q 
that minimizes the KL divergence D(P X || Px)- The following theorem characterizes an 
important decomposition property for evaluating the quality of an approximation P x . The 
approximation that minimizes the KL divergence is the one that maximizes a sum of di¬ 
rected informations from parent sets to children. 


Theorem 5 (Quinn et al., 2013a) For any distribution Px, 


m 

argminD(P x || P x ) = argmax I(X^^ —> X* )• (8) 

PxfiG Px&G i= 1 


Remark 6 In Quinn et al. (2013a), only the specific case |A(*)| 
the proof naturally extends to the general case. 


1 was considered but 


This decomposition property will be important for the following results. 


4.1 An Unconstrained Formulation 

Consider finding an optimal approximation of the form ([7]) where the only constraint is that 
the in-degrees are |A(i)| = K > 1. We assume uniform K for simplicity. The results hold 
if K is a function of i. Let Qk denote the set of all such approximations. The formula ([8]) 
simplifies. 
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Algorithm 1 . OptimalGeneral (Quinn et al. 2012) 


Input: V1 B ndlnd,K, m 


1. For i G [m\ 

2 . A(i) <- 0 

3. B A- {B : B C [m]\{t}, |R| = A} 

4. A(i) •<— argmax I(X B —> X,;) 

5. Return {A(i)}'fL l 


Corollary 7 Quinn et al., 2012) For any distribution Px, i/ie parent sets {A*(i)} r f =1 cor¬ 


responding to an optimal approximation P* G argminD(Px || Px) satisfy 


A*(i ) G argmax I(X^(j) 
A(i)-.\A(i)\=K 


X,; 


Thus, finding the optimal structure is equivalent to finding the best individual parent 
sets for each node. The process is described in Algorithm 1. A modified Algorithm 1 for 


exact structure learning was presented in Quinn et al. (2012). Algorithm 1 takes as input 


the following set of directed information values, 

T>PBndind = |l(Xs(j) -t XQ : i G [m],B(i) C [m]\{*}, \B(i)\ = A'j . 

Theorem 8 (Q uinn et al.\\20 12) Algorithm 1 returns an optimal approximation Px G Qk- 
We next consider a more specific class of graph structures. 

4.2 Finding a Connected Graph 

Algorithm 1 might return an unconnected graph. For situations where information or influ¬ 
ence propagates in the network, it can be better to work with connected structures. Directed 


trees, the simplest connected structure, were investigated in Quinn et al. (2013a). While 


visually simple and computationally easy to identify, they cannot depict complex dynam¬ 
ics such as feedback. We next consider a balance between the properties of unconstrained 
bounded in-degree approximations and directed trees. The new approximations contain a 
directed spanning tree as a subgraph and have user-specified in-degrees. See Figure |l(b)| 
Note that the root node has no parents. Remark 10 will explain how to obtain graphs where 
the root also has parents. 

Let Qk be the set of all graphs containing a spanning tree and all nodes except the root 
have in-degree K > 1. Let A(i,j) be the best set of K parents for X* that contains the 
edge Xj ->• X;, 


A(i,j)= argmax I(X 4(i) -> XQ. 

A(i):A(i)C[m]\{i},j£A(i) 


(9) 


Then assign weight I(X^ ^ -A- XQ to edge Xj -A X* in the complete graph and run a 
maximum weight directed spanning tree (MWDST) algorithm. Each edge X y -a X,; in the 
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Algorithm 2. OptimalConnected 

Input: £>Z B ndlnd, A', to _ 

1. For i G [to] 

2 . A(i) <- 0 

3. For j G [m]\{i} 

4. ^ {B : B C[m}\{i}, \B\ = K, j e B} 

5. A(i,j) -G- argmax I(X B —> X,;) 

6. {a(z)}™! <- MWDST ({I(X^ ( . X ?; )}i<^< m ) 

7. For i G [to] 

8. A(i) 4— A(i,a(i)) 

9. Return {A(i)} r jf =l 


spanning tree induces the corresponding parent set A(i,j ) for X,. This process is described 
in Algorithm 2. 

Theorem 9 Algorithm 2 returns an optimal approximation Px G Gk- 
The proof is in Appendix [A] 

Remark 10 The approximations Px G Gk have root nodes with no inward edges. Algo¬ 
rithm 2 can be modified to find the best approximation where all nodes have in-degree K and 
there is a directed spanning tree as a subgraph. Namely, create a dummy node Xo, set edge 

weights I(Xj —> Xo) < -oo and I(Xo —> Xj) < -1 for all j G [to]. Note that all the other 

edge weights are directed informations, which are KL divergences and hence non-negative. 
Then Algorithm 2 will set Xo as the root with a single outward edge. 

Algorithms 1 and 2 find optimal approximations in terms of KL divergence D(Px||-Px)- 
They only need distributions over K + 1 processes, not the full joint distribution. However, 
they compute directed informations involving I\ processes. If K is large, this could 

be computationally difficult. For some applications, instead of reducing K , it is better to 
efficiently identify near-optimal approximations. 

5. Near-Optimal Bounded In-Degree Approximations 

We next find sufficient conditions to identify near-optimal approximations in time polyno¬ 
mial in K. 

5.1 Greedy Submodularity 

Consider the following greedy procedure to select a parent set for Xj. Initially, set X,;’s 
parent set as the best individual parent Z = argmax^ I (Xj —> Xj). Then look for the 
second best parent Z 7 = argmax^ I(X ? —> Xj||Z). Repeat this K — 2 times, adding one 
parent at each iteration. 

In general, greedy methods are not provably good. We next describe sufficient conditions 
to guarantee near-optimality. 
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Definition 11 A joint distribution Px is called greedily-submodular if there exists an a > 
0, such that for any process Y and any subset X w of other processes, 

I(Xj -> Y||X 1; ..., Xj_ 2 , Xj_!) < aI(Xj_! ->■ Y||Xx,..., Xj_ 2 ), (10) 

for all 1 < j < |W| where the processes in X w are indexed according to the order in which 
they are selected by the greedy algorithm. 


This is a weaker condition than submodularity, a discrete analog of concavity (Nemhausei 


et ah, 1978). If Px has submodular directed information values, then for all pairs of pro¬ 


cesses {Xj, Y} and sets of other processes X 5 C Xg/ C X\{Xj, Y}, 


I(Xj -> Y||X 5 ,) < I (Xj -A Y||Xg). 


( 11 ) 


Submodularity implies conditioning does not increase directed information. 

Corollary 12 If Px is submodular, it is also greedily-submodular with a < 1. 

Proof Let S = {1, ..., j — 2} and S' = S L) {j — 1}, and let the processes be labeled in the 
order they are selected in a greedy search for parents for Y. Then if Px is submodular, 


I(Xj -> Y||X 1; ... ,X i _ 2 ,X J _ 1 ) < I(Xj ->■ Y||X 1} ... ,Xj_ 2 ), 

— I(Xj- 1 —t Y||X!,.... Xj_ 2 ), 


( 12 ) 

(13) 


where (12) would hold by pT] ) and (13) would hold because Xj is picked after Xj_i in a 
greedy search. Thus, © holds with a < 1. ■ 


Entropy is submodular (Fujishige 1978). However, in general mutual information and 


directed information are not, as shown in the following example. 

Example 1 Let {N,X,Z} be mutually independent, zero-mean, i.i.d. Gaussian processes. 


Let Y t+ \ = X t + Z t + N t . Then using stationarity (Cover and Thomas, 2006, pg. 256), 


I(X->Y) = I (X i; Y 2 ) = -log(\+ (N] 

2 \ var(zi) + var(A'i) 


1 , / varl 

< 2 og V 1 + ^7(Xi)y 




= l(X 1 -Y 2 \Z 1 ) = l(X^Y\\Z). 

Since conditioning can increase directed information, it is not submodular. 

Remark 13 The authors are not aware of this property being discussed in the literature 


previously. Two other conditions that are weaker than submodularity are discussed in Cevher 


and Krause (2011) and Das and Kempe \201f) . The former uses submodularity up to an 
additive error. The latter uses submodularity up to multiplicative error. Both measure 
the increase in conditioning of the terms in unlike ( [Io| ) which only bounds sequential 

increases while greedily selecting a parent set. 
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(a) The bound in Theorem 14 with L = K. 


(b) The bound in Corollary 15 with L = 2. 


Figure 2: Plots for the bounds in Theorem 14 and Corollary 15 respectively. 


Assumption 1 We assume that Px is greedily-submodular. 

When Assumption [l] holds, the greedy search yields a near-optimal approximation. Let 
A denote the set of indices for an optimal set of K parents and B the indices for the greedily 
selected set of L < K parents. 


Theorem 14 Under Assumption^ 7J 

I(X B -4 Y) > ( 1 - exp 


—L 


Ei= 


K-l 


cr 


I(Xa^Y). 


The proof is in Appendix [B] 

Recall from Theorem [5] that the larger the sum of directed information values from 
parent sets to children is, the better the approximation is. Theorem |14| implies that greedy 
approximations are near-optimal approximations. Figure 2(a) shows the bound coefficient 
in Theorem 14 for a € {1.3,1.7, 2.5}. In Example [lj if the variances were equal, a = 1.71 
would suffice. 

We can also bound how close an optimal parent set Al with in-degree L is to an optimal 
parent set Ak with in-degree I\ > L. 


Corollary 15 Under Assumption [IJ with q/1, 

a L — 1 


I(X, -4 Y) > 


a 


K-l 


I(X, 4Y). 


The proof is in Appendix [C} The bound coefficient is plotted in Figure [2(b)| 


5.2 Near-Optimal Solutions for the Unconstrained Problem 

We next consider Algorithm 3 which uses a greedy search to find a near-optimal solution to 
(|8]) where the only constraints are in-degree bounds, similar to Algorithm 1. Since the greedy 
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Algorithm 3. Near-OptimalGeneral 
Input: L, m 

1. For i £ {1,... , m} 

2 . R ( i )<-0 

3. While \B(i)\ < L 

4. For l E {1,..., m}\{B(i) |J{z}} 

5. Compute I(X; —> Xi||X B ^) 

6. B(i) <r- B(i) (J arg max I(X; -A Xj||X B(i )) 

i 

7. Return {B(i)} r fL 1 


search is adaptive, directed information values will be computed as needed. Let {B{i)} r fL 1 
and {A(z)}£L 1 denote the parent sets returned by Algorithms 3 and 1 respectively. 


Theorem 16 Under Assumption [7J 

m / 

J2 1(XB(t) ->• Xi) > | 1 — exp 


—L 

K—L 


i= 1 


E J\— l 

i=0 a 


I El(XA ( i)->Xi). 
2=1 

Proof The proof follows from Theorem 14 holding for each i £ {1, ..., m}. 


Remark 17 The edge weight I(Xs(i) —> X*) can he computed using a chain rule (Kramer , 


1998). Let {ji, j?,... ,jr} denote B(i). Then 

L 


I(X B(i) -A Xj) = £l(X, -A X/11Xj,,.. .X^J. 


Z=1 


5.3 Near-Optimal Solutions for Finding Connected Graphs 

We now propose Algorithm 4 which uses a greedy search to find a near-optimal connected 
solution to ([8]). Similar to Algorithm 3, it precomputes parentsets for each possible directed 
edge. Then a MWDST algorithm is called. In Algorithm 4, B(i,j) is the set of parents for 
Xj with Xj as one of the parents, selected in a greedy fashion. The value I(Xg^ ; —> X,;) 
is the weight of edge Xj -A X* given to the MWDST algorithm. Let {A(i)}^” 1 denote the 
parent sets returned by Algorithm 2. 

Theorem 18 Under Assumption [7J for Algorithm 4, 

exp 

The proof is in Appendix [D] 




X,) > 1 - 


2=1 


—L 


Xf=o 


X>X4 (i) ^Xj). 


2=1 
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Algorithm 4. Near-OptimalConnected 
Input: L, m 

1. For i G {1, ... , m} 

2 . £(*)<-0 

3. For j G {1,..., m}\{z} 

4- f-Jj} 

5. While |B(*,j)| < £ 

6. For i G {l,...,m}\{5(t,j)U{*}} 

7. Compute I(X/ 

8. B(i, j)^B(i,j) UargmaxI(X z X^X^.^) 

9. {6(0}^! <- MWDST ({I(X 5( . 

10. For i G {1,..., m} 

11. B(i) ^ B(i,b{i)) 

12. Return {R(i)}™ 1 


6. Best r Bounded In-Degree Approximations 


For a given Px, Algorithms 1-4 each return a single solution. For many applications, 
knowing the r-best approximations, where r might be five, twenty, or a hundred, can be 
advantageous. For instance, there is no guarantee of uniqueness for optimal approximations. 
Additionally, when data is limited or noisy, the actual optimal solution might appear as 
second or third best due to estimation errors. Also, for more complex constraints, such as 
directed trees with depth at most five, it might be easier to find the r-best solutions to the 
more general problem of directed tree approximations, and then pick the highest ranking 
solution that satisfies the extra constraint. Lastly, edges that persist in all of the r-best 
approximations are likely to be important. 

We next discuss methods to identify the r-best bounded in-degree approximations. For 
simplicity, we will focus on altering Algorithm 1 and then discuss differences for modifying 
the other algorithms. A strategy for identifying the r-best solutions for assignment problems 


is discussed in Lawler (1972) based on branching candidate solutions. The method would be 


impractical to apply here. We develop an alternative algorithm, which, like that in Lawler 


(1972), will be based on a branching of candidate solutions. 


6.1 Optimal Solutions for the Unconstrained Problem 

Recall from Theorem [5] that the sum of directed information values from parent sets to 
children corresponds to how good an approximation is. When the only constraint is a user 
specified in-degree K, the parent sets can be chosen independently, as in Algorithm 1. This 
property simplifies searching for the r-best approximations. For instance, the second best 
approximation will only differ from the first by one parent set. 

Algorithm 5 identifies the r-best bounded in-degree approximations in order. It main¬ 
tains a list of candidate approximations. It instantiates that list by calling Algorithm 1. 
Each time an approximation is selected from the list, Algorithm 6 generates new candidate 
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Algorithm 5. TopRGeneral 

Input: PX Bn dlnd, K, m, r _ 

1. Top 0 

2. Z ^— 0 

3. <S OPTIMALGENERAL(PX B ndInd, R, m) 

4. While l < r 

5 . li-l + 1 

6. Top(l) <- argmax^^i I(X A(i) -A Xj) 

PxeS 

7. 5 e- 5 U GETNEwSoLN(PX B ndind, A", m, Top(l )) 

8 . 5 5 \ Top 

9. Return Top 


Algorithm 6. GetNewSolns 


Input: Vl B ndind,K,m,{A'('i)} 1 i T 1 

1. 

5^0 


2. 

For i € {1,. 

■ •, m} 

3. 

wmi 

<- {A\i)}? = i 

4. 

B <- {B : 

B C [m]\{i}, \B\ = K, I(Xb —t Xj) < I(X 4 ' W Xj)} 

5. 

A(i) argmax I(X R -» Xj) 



B&B 

6. 

5F5U{{A(*)}^r} 

7. 

Return S 



approximations from that “seed.” Algorithm 6 finds m solutions by keeping all but one 
parent set, replacing it with the next best one. 

To simplify the presentation, we will make the following assumption to avoid ties. 

Assumption 2 For a given joint distribution F*x, for any process Xj, no two parent 
sets A(i),B(i ) C [m]\{i}, with |A(i)| = \B(i)\ have identical directed information values, 
I(X 4W -A Xj) + l(X m -> Xj). 

For cases where Assumption [2] does not hold, line 4 in Algorithm 6 can be modified to 
check not only values, but elements of parent sets as well. 

Theorem 19 Under Assumption [S| Algorithm 5 returns the r-best bounded in-degree ap¬ 
proximations. 

The proof is in Appendix [Ej We also provide a discussion for how to index approxima¬ 
tions for efficient implementation of lines 6-8 of Algorithm 5 in Appendix [Fj 


6.2 Optimal Solutions to Find Connected Graphs 


Identifying the r-best connected approximations is more complicated than the general case 
in Section 6.1 Algorithm 5 would not need to be modified, but Algorithm 6 would. Recall 


14 









Bounded Degree Approximations of Stochastic Networks 


from Algorithm 2 line 5 that each edge Xj —> Xj in the complete graph is assigned an edge 
weight iCX^jj) —> Xj). If the edge Xj -» X,; is selected by the MWDST algorithm in 

line 6 , then A(i,j) is the parent set assigned to Xj. To find the r-best connected approx¬ 
imations, as in Algorithm 6 , “seed” approximations should be modified to generate new 
candidate approximations. However, Algorithm 6 will not work properly. Some candidate 
approximations might be identical to the seed. 

To see this, let {ji, J 2 , • • •, jx} denote Suppose A(i,j i) = A(i,j 2 ) and X^ —»• Xj 

was an edge in the MWDST that induced the seed approximation. Thus, A(i,j 1 ) is the 
parent set selected for Xj. In generating new approximations, even if Xj, —> Xj is given 
a smaller weight, I(X^ ?; ^.. —> Xj), edge Xj 2 —> Xj might be selected by the MWDST 
algorithm instead, yielding the same parent set. 

One approach to generate candidate approximations involves checking whether modify¬ 
ing a single edge weight in the complete graph, such as setting Xj, —> Xj to have weight 
^(ij) —> Xj), does result in a candidate approximation different than the seed. If not, 

then all subsets of edges inducing the same parent set A(i,j 1 ) should be modified. Thus if 
A(i, j\) = A(i, j' 2 ) = A(i, j ;$), then the weights of edges {Xj, —y Xj}, {X^ —> Xi,Xj 2 —> Xj} 
{Xj, —> X,, Xj 3 —> Xj}, and {Xj, —> ~K{,Xj 2 —> —>• Xj} should be modified. Any 

resulting candidate parent sets that differ from the seed should be retained. 

6.3 Near-Optimal Solutions for the Unconstrained Problem 

Algorithm 5 generates the r-best approximations, but calls Algorithms 6 which uses an 
exhaustive search. A greedy search can be used instead to generate r approximations. 
Consider the first time that Algorithm 6 is called. Let {ji,j 2 , ■ ■ ■ ,ji<} denote the parent 
set, in order they were added, for node i. When i’s parent set is changed, in line 5 set 

B {B : {ji,.. • Ja'-i} C B C [m]\{i, j K }, \B\ = I\}. 

So only the parent added last in the greedy search is changed. 

Consider the first time that Algorithm 6 is called where i 's parents in the seed approx¬ 
imation is {. 71 ,.. -Jk-iJ'k} f° r some 3k / 3k- Then set 

B {B : {ji, ■ • • ,3 k- 1 } CBC [m]\{i,j K ,j'K}i \ B \ = K}. 

This can be repeated m — I\ — 2 more times until \B\ =0. When this occurs, the (K — l)th 
parent needs to be changed. Set 

B {B : {. 71 ,.. •, ji<- 2 , JA'-i} CBC [m)\{i,j K - 1 }, \B\ = K}. 

where 

3k- 1 = argmax I(Xj Xj||X jlv ..j K _ 2 ), 

the next (K — l)th parent selected in a greedy order. Continue in this manner until Algo¬ 
rithm 5 selects r approximations. 

Note also that we can combine the modifications discussed here with those in Section [?i~ 2 l 
to identify the top r connected approximations using a greedy search. 
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7. Complexity of Proposed Algorithms 

This section explores the computational complexity of the algorithms and storage complex¬ 
ity of the approximations. 

First, calculating I(X, Z —> Y) in general has exponential complexity. Note that 

n 

I(X,Z -> Y) = (Y t ;X*“ 1 ,Z t-1 |Y t-1 ). 

t =l 


The last term in particular, I(Y^; X n_1 , Z n ~ 1 \Y n ~ 1 ), involves a sum over all realizations of 
3n — 2 random variables. Thus, with X denoting the alphabet, the last term has complexity 
C(|X| 3n ). We will assume Markovicity of a fixed order l, so 

n 

I(X,Z —► Y) = ^I(Y;A*- 1 ,Z*- 1 |Y t t - 1 ). 

t=i 


The complexity of computing I(X, Z —y Y) then becomes C?(n|X| 3i+1 ) = 0(n). More gen¬ 
erally, computing I(X B —> Y||X S /), where \B\ + \B'\ = K, has 0(n|X|( A+1 ^ +1 ) complexity 
assuming Markovicity. 


Assumption 3 We assume Markovicity of order l. 


7.1 Algorithm 1 . OptimalGeneral 

For each process Xj, for each of the ( m ' K 1 ) possible subsets B with \B\ = K, I(X B -A Xj) is 
computed. Each computation has complexity C(n|X|^ A+1 ^ +1 ). Thus, the total complexity 
for Algorithm 1 under Assumption [3] is G(m( m ^ 1 )n|X|( A - +1 ^ +1 ), or 0(m K+1 n ) for fixed I\. 

7.2 Algorithm 2. OptimalConnected 

Algorithm 2 computes the same directed information terms as Algorithm 1. It also com¬ 


putes a MWDST, which takes 0(m 2 ) time (Edmonds, 1967). Under Assumption [3J the 
total complexity is 0(m( m J ^ 1 )n|X|^ A " +1 ^ +1 + m 2 ). If K is fixed, the complexity becomes 
0(m K+1 n). 


7.3 Algorithm 3. Near-OptimalGeneral 

For each process X,; there are {m — 1) directed information terms computed involving two 
processes, of the form I(X ? —y X,;). Next there are (m — 2) computed involving three 
processes, and so on. The complexity is thus 

K 

0(m^2(m — l)n|X|(* +1 ^ +1 ) = 0(m 2 Kn\X\^ K+1 ^ l+1 ). 

i= 1 

For constant K, this becomes O (m 2 n\. 
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7.4 Algorithm 4. Near-OptimalConnected 

For each ordered pair of processes (X.;, Xj), first there are (m —2) terms computed involving 
three processes, such as I(X^. —> Xj||Xj). Next there are (m — 3) computed involving four 
processes, and so on. Then a MWDST algorithm is called. The complexity is thus 


A'-l 


0(m 2 + m{m — 1) (m — 1 — i)n|X|^ +2 ^ +1 ) = 0(Km 3 n\X\^ K+1 ^ l+1 ). 


2=1 


For constant K, this becomes O (rn 3 n). 


7.5 Algorithms 5. TopRGeneral 

There are three main bottlenecks in generating the top-r solutions. The first is computing 
the directed information terms, the same as used in Algorithms 1 and 2, 0(m K+1 n) for fixed 
K. The second is sorting those values for each process X,;. Merge sort, for example, can sort 
an array of h elements in 0(h\ogh) time (Katajainen et ah, 1996). For each of the m pro¬ 
cesses, there are ('"^T 1 ) values, so sorting takes 0(m { rn K [ ) log { rn }< 1 )) = 0(m K+1 logm) 
for fixed K. The third bottleneck is the search for candidate solutions. Algorithm 6 
generates m new solutions each time it is called, replacing one parent set for each ap¬ 
proximation. The branching overall generates 0{rm ) candidates. For small r, such as 
r = 0 {log ) ) = 0{m log m K ) = O(mlogm) for fixed K, the computing and sort¬ 
ing the directed information values dominates. Recall that ( rn ^ 1 ) m is the total number of 
bounded in-degree approximations. For large r, such as r = ( m ^ 1 ) m / c f Q r some c > 1, 
then r = 0(m Km ) and so the branching dominates. The total complexity with fixed K is 
0(m K+1 (n + logm.) + rm ). 


7.6 Storage Complexity 

An important benefit of using approximations is that they require substantially less storage 
than the full joint distribution. The full joint distribution has mn random variables, and 
so requires 0(\X\ mn ) storage. Under Assumption [3j the storage complexity of the full joint 
distribution is C(n|X| m ^ +1 ^) and of approximations of the form ([T]) is 0(mn\X\ ( ' K+1 ^ l+1 ) = 
0(mn ) for fixed K. 


8. Simulations 

We investigated the performances of Algorithms 1-5 using simulated networks. Comparisons 
of greedy and optimal search, unconstrained and connected approximations, and the top-r 
approximations were studied. 

8.1 Greedy vs. Optimal Search 

We first compare the near-optimal and optimal approximations identified by Algorithm 3 
and Algorithm 1 respectively. 
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8.1.1 Setup 


Markov order-1 autoregressive (AR) networks, of the form X; = CX t _i+Nt, were simulated 
for a given m by rn coefficient matrix C and i.i.d. noise vector Nt. Two network sizes 
m E {6,15} were tested. For each m, there were 250 trials. In each trial, the coefficient 
matrix C was randomly generated. Edges (non-zero off-diagonal entries in C) were selected 
i.i.d. with probability 1/2. Non-zero AR coefficients were drawn i.i.d. from a standard 
normal distribution. C was then scaled to be stationary. The noise process {A/}/ =] had 
i.i.d. entries drawn from a normal distribution with mean zero and variance 1/4. Data was 
generated for n = 1000 time-steps. 

For each network, unconstrained bounded in-degree approximations {AoPT(*)}”ii and 
{Agrd(*)K=i were computed using Algorithms 1 and 3 respectively. For m = 6 and m = 15, 
in-degrees K = 2 and K = 4 were used respectively. Performance was measured by the 
ratio 


g£i gjgrt -**■ 

Eti i(X4 OPT(i , 


x. 


(14) 


The value of each sum corresponds to how good that approximation is. For (14), the 
directed information values were calculated exactly using approximated parent sets. 

Both algorithms computed directed information estimates using the simulated data. 
The estimate for a directed information of the form I(X —> Y||Z) was computed as follows. 
Least square estimates for the coefficients in two AR models, 


Yt — b\Yt-i + 62^-1 + b-jXt-i + Nt (15) 

Y t = b\Y t -\ + + N', (16) 


were computed. 
H(Y t |r t _ 1 ,Z t _ 1 , 


Let a and a' denote std (N t ) and std(N[) respectively. The entropy 
Xt- 1 ) is l/21og 2 (2vrea 2 ) (Cover and Thomas, 2006, Theorem 8.4.1), so 


^ 1 ^ ^ 

I(X -> Y||Z) = - V H(Y t |Y t _!, Zt^) - H(Y t |Y i _ 1 , Z t - X ,X t - X ) 

n t =1 

= 2 log(27re( ( j / ) 2 ) - ^ log(27recr 2 ) 

= log a /a. 


8.1.2 Results 


The approximations found by the greedy and optimal search were largely identical. Figure [3] 
shows histograms of percentages of the ratio (14), normalized by the number of trials. The 


rightmost column in each histogram corresponds to (14) being one, when the greedy search 
returned the optimal approximation. 

For m = 6 , the greedy search found the optimal solution in 96.4% of the trials. The 
average ratio was 99.9% ± 0.6%. The minimum ratio was 92.0%. 

For m = 15, the greedy search found the optimal solution in 57.6% of the trials. The 
average ratio was 99.6% ± 0.9%. The minimum ratio was 93.3%. 


By Theorem 14, the best case lower bound of (14) expected is 63.2%, which corresponds 


to a = 1. On average, the greedy algorithm performed much better than the lower bound, 
often the same as or close to the optimal. 
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Histogram of total Dl ratios between greedy 
and optimal, m = 6, K= 2 


Histogram of total Dl ratios between greedy 



85 90 

Ratio of total Dl 


(a) m = 6. 


(b) m = 15. 



80 85 90 95 100 

Ratio of total Dl 


Figure 3: Histograms of the relative performance of Algorithms 1 and 3 using the ratio (14). The 


right columns correspond to when both algorithms identified the same approximation. 


8.2 Comparison of Top-r Approximations for Algorithms 1,2,3,4 

We next investigated how well the approximations from Algorithms 1, 2, 3, and 4 compared 
to the true parent sets. We used modified versions of the algorithms to produce the top-r 
approximations for each class, as discussed in Section [6] (such as Algorithm 5). For the 
MWDST algorithm we used Choudhary (2009). 


8.2.1 Setup 


The setup was similar to that described in Section 8.1.1 For the approximations, there 


were multiple in-degree K values. For m = 6, in-degrees K G {1,2,4} were used, and for 
m = 15, I\ G {2,4,8} were used. Performance was measured by the ratio 

E;=i -)• X;) 


Y7=i IQU 


e(i) 


X, 


(17) 


where A{i) denotes a parent set induced by an approximation, and A^ vne (i) is the true 
parent set. The ratio © was calculated exactly. For each type of approximation, the top 
r = 10 approximations were found and performance was averaged across trials. 


8.2.2 Results 

Overall, the different approximations (unconstrained or connected, using optimal or greedy 
search) performed comparably for each of the (m, I\) pairs. The results are shown in 
Figure |4j The most noticeable variation in performance was due to different in-degree 
K values. For both m = 6 and m = 15, increasing K, especially when K was small, 
substantially improved performance. 

The connected approximations performed only slightly worse than the unconstrained. 
For larger m and K the difference diminished. For each (m, K) pair, on average the approx¬ 
imations returned by the greedy search performed almost as well as that of the optimal. 
This is consistent with the results in Section 8.1.2 (see Figure [3]). 
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Average Dl ratio for m = 6 and K = 1 Average Dl ratio for m = 6 and K = 2 



(a) m = 6, K = 1. (b) m = 6, K = 2. 


Average Dl ratio for m = 6 and K = 4 


| l I Qptfiftn IB nptGnn MfirriPipn ^Bfirrir.nn] 



r 


(c) m = 6, K = 4. 


Average Dl ratio for m = 15 and K = 2 Average Dl ratio for m = 15 and K = 4 Average Dl ratio for m = 15 and K = 8 



Figure 4: Plots of the average ratio of approximations to the true parent sets. Standard 
deviation error bars are shown. For each type of search, optimal (“Opt”) and greedy 
(“Grd”), and each type of approximation, unconstrained (“Gen”) and connected (“Con”), 
the top-r approximations are shown, with r £ {1, 5,10}. 


Performance did not decay appreciably among the top approximations. For m = 6 and 
K = 1, the performance difference between the first and tenth approximation is distinguish¬ 
able, but for others it is not. 


Figure [5] shows diagrams of the top r = 4 unconstrained and connected approxima¬ 
tions for a single trial with m = 6 and K = 2. Figures |5(a)| and 5(e) are the optimal 
unconstrained and connected approximations respectively. The figures to the right show 
differences between the optimal and rth best approximations for r £ {2, 3,4}. Dashed gray 
edges are those removed and solid black ones are edges included. For example, the third 
best unconstrained approximation, Figure 5(c), has all of the same edges as the optimal, 
Figure 5(a), except it has X 2 —> X 4 instead of Xi —y X 4 . 


For both the optimal unconstrained and connected approximations, the several next 
best approximations had only minor changes. Many edges were preserved. For this partic¬ 
ular trial, among the top four unconstrained approximations the only differences involved 
the parent of X 4 . For the connected approximations, the second, third, and fourth best 
approximations mostly varied for the parent of X 5 . The parents of Xi, X 2 , and X 3 were 
identical for all of the approximations shown. 
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(e) Connected, r = 1. 
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(b) General, r= 2. 
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(d) General, r = 4. 
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(f) Connected, r = 2. 

(g) Connected, r = 3. 

(h) Connected, r = 4. 


Figure 5: Diagrams of the top r = 4 unconstrained and connected approximations using optimal 
search for a single trial with m = 6 and K = 2. Figures 5(a) and 5(e) are the optimal 
unconstrained and connected approximations respectively. The figures to the right show 
differences between the optimal and rth best approximations. Dashed edges were removed 
and solid edges were included. 


8.3 Performance Decay for Large r 

We also investigated the decay in performance quality as r —> ( m ~ 1 ^ m 


8.3.1 Setup 


Using a simulation setup similar to Section 8.1.1 150 trials with m = 6 processes and in¬ 


degree K = 2 were run. Unconstrained bounded in-degree approximations were obtained 
using Algorithm 5 and a similarly modified Algorithm 3 to obtain optimal and greedy search 
orderings respectively. All approximations were computed, with r = = 10 6 . The 

ratio © was used to measure performance. 


8.3.2 Results 

Figure | 6 (a)| depicts the ratio ( fTr| ) for all approximations, ordered using Algorithm 5, for 
three trials. Due to estimation error, the actual ratio values did not decay monotonically. 
Figure [ 6 (b)| plots the estimated values. 

Figure [7] shows the true ratio ( |17[ ) value using the ordering returned by Algorithm 5, 
averaged over 150 trials. The mean and one standard deviation are represented by the black 
and green curves respectively. Consistent with the curves for individual trials in Figure | 6 j 
the shape of the curves in Figure [7] are similar to a logit function. There was a sharp 
decrease in the quality of approximation in the low and high r regimes, with a nearly linear 
decay for most r. 
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Dl ratio for all approximations in three 


trials with optimal search 



(a) Actual ratio values. 


Scaled Dl estimates for all approximations in 
three trials with optimal search 



(b) Estimates used by Algorithm 5, scaled. 


Figure 6: The values of the ratio © for all approximations in the order selected by Algorithm 5. 


Three trials are shown. Figure 
to estimation error. Figure [6(b) 
Figure 6 (a)| ). 


6 (a) shows the actual ratio values. The spread is due 
shows the estimated values (normalized to the scale of 



Figure 7: The ratio © for all approximations in the order selected by Algorithm 5, averaged over 
150 trials. The black and green curves depict the mean standard deviation respectively. 


The greedy search, using a modified Algorithm 3, performed comparably to the optimal 
search, Algorithm 5. For r = 1, this was shown in Figure [3j and for r < 10 this was 
shown in Figure [4| However, the current analysis confirmed that even for large r this was 
true. The analogous Figure [7] for the greedy ordering was visually indistinguishable and so 
not shown. However, effects of greedy ordering were clearly seen in some trials. Figure [ 8 ] 
depicts estimate values for approximations in one trial, with the light blue monotonic curve 
for the optimal ordering and the discontinuous black curve depicting the greedy ordering. 
The large jumps are characteristic of the depth-first search, as the worst-case along one 
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Scaled Dl estimates in one trial comparing 
optimal and greedy search 



Figure 8: The estimate values used by Algorithm 5 and a modified Algorithm 3 to rank uncon¬ 
strained approximations using optimal and greedy search. One trial is shown. Estimates 
are scaled to actual ratio values ©■ The smooth light blue curve corresponds to ordering 
from the optimal search. The black curve corresponds to the greedy search. 
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Figure 9: The difference between the ratio © for the actual optimal ordering of unconstrained 
approximations and the ordering returned by Algorithm 3 (using estimates). Results are 
averaged over 150 trials. The black curve depicts the mean. The green curves depict one 
standard deviation. 


branch is worse than the best-case of the next, although the former branch was initially 
more promising. However, such large discontinuities as shown in Figure [ 8 ] were rare overall. 
The ordering returned by the greedy search was largely similar to the optimal ordering. 

As illustrated in Figure [ 6 j estimation errors led to errors in the optimal search ordering 
returned by Algorithm 5. It is useful to characterize how large those ordering errors were. 
Figure [9] shows the error, measured by the difference of the ratio © between the actual 
rth best approximation and the rth approximation in the ordering returned by Algorithm 5 
for r < 1000. The results were averaged over 150 trials. In terms of percentage points of 
the sum of directed information of the true parent sets to children, the denominator of ©> 
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Algorithm 5 performed well despite the estimation error. The mean difference was slightly 
biased above 0 and the standard deviations were within 4% for all r < 1000. The slight 
bias in the mean was expected; for small r, the approximations were mostly replaced with 
worse approximations, if any. Overall, the greedy approximations performed nearly as well 
as the optimal approximations. 


9. Conclusion 

In this paper, we presented several novel methods related to approximating directed infor¬ 
mation graphs. The approximations allowed substantial flexibility for users. Each algorithm 
took the in-degrees of the nodes as input. With larger in-degrees, the approximations be¬ 
came better, but at the cost of visual simplicity of the graph and computational efficiency. 
The approximation could be unconstrained or connected, and found using an optimal or a 
more efficient, near-optimal greedy search. Furthermore, one could generate the top r solu¬ 
tions, not only the best. This enabled evaluation for which edges are most significant as well 
as finding the best solution of a more constrained class of topologies. Lastly, the empirical 
results demonstrated the utility of these methods, especially showing that on average the 
greedy search performed much better than the worst-case lower bound. 
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Appendix A. Proof of Theorem [9] 

Proof Let T be the set of all directed spanning trees on nn nodes. For a given tree TgT, 
let Qjt C Q k denote the set of approximations Px G Gk that contain T as a directed 
spanning tree subgraph. Every Px £ Gk contains at least one such T G T as a subgraph, 
so Gk = UtsT^x - 

For any tree T G T, the best approximation P^ G Gj( is the one that for every edge 
X„(j) — > Xj in T, sets A(i,a(i)) as the parent set for node i. This follows from ([9]), since 
a(i) G A(i,a(i)), so T will be a subgraph of this approximation, and the sets A(i,a(i)) are 
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the best such parent sets. Thus, 


.max 


max max 
Ter 


P ^k i=l 


X>X4 (i ) 


X, 


max^^X^.a^)) ->X,:), 
i— 1 


(18) 

(19) 


where (18) follows since Gk = [JreT^K an<4 (19) uses that P-J is the best approximation 
in G^- Algorithm 2 finds the solution to (19) and thus identifies the optimal approximation 

% e Qk- ■ 


Appendix B. Proof of Theorem 14 


The proof is based on the proof for a related bound for submodular functions (Nemhauser 


et ah, 1978). 


Proof For simplicity, we prove the case Af]B = 0. The other case is almost identical and 
results in a tighter bound. For that case the greedy algorithm selects each element of B 
before any element of A. Let l < |P| = L. Let Ai be the set A but ordered according to 
how the greedy algorithm would pick elements from A after picking {B(l),..., B(l)}. 

We first note two inequalities. For all l < L, 


I(X S ( i+ i) Y II^{B(1),...,B(Z)}) ^ I(Xa,(i) -a Y||X{s( 1 ),..., j b(0>), ( 20 ) 

which holds since the greedy algorithm selects X B ^ +1 ) after {X S (p,..., X B (^}, and 

a 1 1 I(X Ai(1) —> Y||X{B(i) i ... j s(qj) > I(X A; (j) —> Y||X{ B (i) ) ... ) B(z),Ai(i),...,A i (i-i)})> (21) 

which follows from Assumption [I] for the set A U {P(l), ..., B(l)}. 

We now compare an optimal solution A to the first l elements in the greedy solution B. 

I(X A —t Y) — I(X { b ( 1 ),...,u( 0} Y ) 

< I(X AU {b(i),...,b(0} “^Y) — I(X{b(i),...,b(Z)}—^Y) 

= I(X{B(1),...,B(Z)} Y) 

K 

+£ i ( x *,«->y 

l|Y{B(l),...,B(0}U{A(l),...,A i (i-l)}) 

i= 1 

—I(X{s(i),...,B(b} ^ Y) (22) 

K 

< Y a i - 1 I(X Aj(1) -»■ Y||X {fl(1)) ... iB(0} ) (23) 

i=l 

K 

< Y « i “ 1 I(XBd+D ->• Y||X { B(1),..., B (0})- ( 24 ) 

i— 1 


25 










Quinn, Pinar, and Kiyavash 


Equation (22) follows from the chain rule applied in the order the greedy algorithm would 
select from Au{.B(l), ... Equations (23) and (24) follow from (21) and (20) respec¬ 

tively. 

Let 5i := I(Xa “^ Y) —I(X{b(;l),...,_b(i)} ->■ Y). Then^—5/+i = I(X B ( i+1 ) —* YJIX^^^... )B (z)})- 
Also denote j3 := Ylf=i al1 - 


From (24) we have Si < (3 (Si — <5; + i), which implies < 


1 — jjj Si. Thus 


Si < 


1 - ^ I So < e 


p So- 


The last step uses the bound (1 — p) < e p , which holds for all p. For 0 < p < 1, both 
sides are positive so the inequality is conserved if powers are taken. Since So = I(Xa 
Y) - 1(0 -> Y) = I(X A -> Y), this gives 


I(X 4 -» Y) — I(X {s(1)i ... )B(0} Y) < e-4I(X 4 -> Y), 


which after rearranging gives the theorem. 


Appendix C. Proof of Corollary |15| 

We will prove Corollary [l5| by first solving the following optimization problem 


K 

max {bi,...,b K } ^2 b i 


(25) 

i= 1 

L 

s.t. YM<c 


(26) 

i— 1 

0 < bi < abi- 1 , i = 2 ,.. 

. ,m, 

(27) 


where K and L are integers such that K > L and a > 1 and c > 0 are real coefficients. Let 
denote a solution to ([25]). 


Lemma 20 For any optimal solution { 6 *,..., b * K }, (26) holds with equality. 


Proof The proof will follow by contradiction. Suppose c — Yli =i K > 0- 

H c -£f=A*)- Define 

T , = / K +7 if I <L, 

1 ' \ b* if i > L. 


Let 7 := 


Note that Yli=] b i = Yli=i &* + 7 = c so the first constraint is met. Also, for i < L, 
bi = b* + 7 < ab*_ 1 + 07 = abi-i, so the second constraint is met. Thus, { 61 ,..., bx} is 
feasible and has a larger sum than the optimal solution, YliLi b i = L 7 + YY=i Ki contra¬ 
dicting {bl,...,b* K y s optimality. ■ 
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Lemma 21 For any optimal solution {6^, ..., b* K }, (27) holds with equality. 


Proof The proof will follow by contradiction. Suppose there is an index i > 1 for which 
b* < a:b*_\. If i > L, then we can set b* <— ab*_ 1 to increase the objective function, which 

contradicts optimality. If i < L, replace 6*_ 1 and b* with bi-\ and bi, where 6j_i = 
and bi = Note that b*_ 1 + b* = 6*_i + bi, and the constraints are still satisfied. 

This exchange necessarily results in b* +1 < ab* < abi. Thus, the exchange can be repeated 
for larger i until i = L + 1. Then set &l+i <— and the objective function is necessarily 
increased, a contradiction. ■ 


We can now find the solution to the optimization problem. 
Lemma 22 The optimal solution to (25) is Ef=i K = c \ ■ 


Proof By Lemmas 20 and 21 the constraints (|26j) and (27) hold with equality. We can 
first solve for b*. 


c = 


K = 


E 4 ;=E a ‘~ lb 

i= 1 i =1 

C 


Eh a*- 1 ' 

Solving for the value of the objective function, 

K K K 


E 6 . - = = 


L ij _ 1 


2=1 2=1 

Using the geometric series formula 

K K -1 


2=1 


Eti« 


E« M = E 


K 


a = 


i =1 


i =0 


1 — CL 

1 — a ’ 


the equation (28) becomes 


K 


E" 

2=1 


sr^K 2—1 i 

Ei= i« = c 1 ~ a 


* C ^L 


K 


EEA-i 1-v 


(28) 


We can now prove Corollary |15[ 

Proof Let the elements of Ak and Al be ordered according to the greedy order. Consider 
the worst case, with I(Xa —> Y) as large as possible, given 

I(X { A«(i) . a kW ) -> Y) < I(X Al -» Y). (29) 


27 
















Quinn, Pinar, and Kiyavash 


The inequality (29) holds by definition of Al being the optimal parent set of size L. Greedy- 
submodularity imposes another constraint. For any 0 < i < K, 

I(Xa* (i+l) Y HX{A K (l),...,A K (i)}) < «!( *-A K (i) -> Y l|X{A K (l),..,AA'(i-l)})' 

Corollary 15 follows from Lemma |22l substituting I(X^ — > Y) for c and —> 


Y l|X{A K (l),...,Aif(*-!)}) f or bt 


Appendix D. Proof for Theorem 18 


Proof Let To denote the MWDST picked by Algorithm 2. For an edge e 6 {Xj —> 
X, : : 1 < j / i < m} in the complete graph on m nodes, let W2(e) denote the weight 
^(y) X j) assigned by Algorithm 2. Define T 4 and uq(e) for Algorithm 4 likewise. 

Also, let c := (1 — exp(—L/^^ q 1 a*))). For each edge e in the complete graph, 


104(e) > cw 2 (e), 

which follows from Theorem [14} Furthermore, 

X>4(e) > J2 w±{e) 

eST 4 eS T 2 

> c'^w 2 (e). 
eST 2 


(30) 


(31) 

(32) 


Equation (31) follows since in Algorithm 4, T 4 was selected as the MWDST, and (32) follows 
from (30). ■ 


Appendix E. Proof for Theorem 19 


To prove Theorem 19, we first show the following lemma. 


Lemma 23 For all 1 < l < r, the Ith best approximation has the same parent sets except 
one as one of the top l — 1 solutions. 


Proof The proof follows by induction. The base case, with 1 = 1, holds trivially as it is the 
only solution. Assume that the statement of the lemma holds for some 1 < l < r. Consider 
the (l + l)th best solution, {B[i)}F =l . Let Xj be a process for which the parent set B(j) is 
not the same as that of the optimal solution, A(j ). 

By Corollary [7j parent sets can be identified independently. Also, by Assumption [2j no 
two parent sets have the same influence. Thus, the optimal parent set is A(j) is better than 
B(j ) . Let A'(j) be any parent set for Xj that is better than B{j). Then, by Corollary [t| the 
parent sets {B(l),..., B(j — 1), A'(j), B(j + 1),... , B(m)} induce a better approximation 
than the (l + l)th best approximation with {B(i)} r fh 1 and therefore must be one of the top l 
approximations. Since this new approximation differs from the (Z + l)th best approximation 
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Algorithm 7. GetParSetIndex 
Input: m,idx 

1. cnt <r- 0 

2. K \idx\ 

3. UK = 0 

4. Return 0 

5. UK = 1 

6. Return idx( 1) — 1 

7. cnt <- e;5 w G?:l) 

8 . idx' •<— {zcte(2) — i<Lc(l),... ,idx(K) — idx(l)} 

9. cnt cnt + GetParSetIndex(m — idx(l),idx') 

10. Return cnt 


in precisely one parent set, the lemma holds. 


The proof for Theorem 19 follows from Lemma 23 since every approximation selected 
in Algorithm 5 is used as a seed in Algorithm 6 to generate all of the best solutions that 
have precisely one parent set different from that of the seed. 


Appendix F. Implementation Notes for Algorithm 5 

In Algorithm 5, for large m and r, a naive implementation of lines 6-8 can be computation¬ 
ally expensive. Specifically, redundant solutions can appear in S , and searching to remove 
redundancies or entries already in Top might be slow. Instead, S can be kept as a priority 
queue of value-key pairs, where the value is the sum of the directed information values ([ 8 ]) 
for the approximation as in line 6 and the key is the index of an approximation. There are 
( m ^ 1 ) possible bounded in-degree approximations, and a binary vector can track whether 
an approximation has been seen or not. 

We now discuss a method to compute an index for each approximation. First, indices 
for individual parent sets will be identified, then combined for an index for the whole 
approximation. Let {ji, • ■ •, jx} denote the elements of parent set A(i), in ascending 
order. For k E [K\, set jk <— jk — 1 if jk > *■ Denote the set of these (possibly) modified 
values by the length I\ vector idx. Then run Algorithm 7. 

Lines 7-9 in Algorithm 7 count how many parent sets of X,; are lexicographically ordered 
before A(i). Line 7 counts how many sets have a first element smaller than idx( 1). Lines 8-9 
use recursion to count how many sets with the same first element idx( 1 ) appear before idx. 

Once the index cq <— GetParSetIndex(m, idx ) for each parent set A(i) of an approxima¬ 
tion is calculated, the index for the approximation can be computed as 



i— 1 
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